By definition of this mixture distribution, we have \[
\begin{aligned}
&p(Z_i=j)=p_j, \,j=1,2,\quad Z_i\,\,\text{i.i.d}\\
&X_i|Z_i=1 \sim N(\mu_1, \Sigma_1),\quad X_i|Z_i=2 \sim N(\mu_2,\Sigma_2), \quad X_i \,\,\text{i.i.d}\\
\end{aligned}
\] It’s worth noting that \(Z_i\) follows a discrete distribution with only two parameters and we have \(p_1+p_2=1\), so it’s more convenient to express the distribution as Bernoulli, i.e. \[
C_i=(Z_i-1)|p_1 \stackrel{\text{i.i.d}}{\sim} \text{Bern}(p_2), \text{ i.e. } p(C_i=1)=p_2
\] So, in the following, we use \(C_i,p_2\), which is equavilent to \(Z_i, p_1, p_2\)
We need to specify priors for \(p_2, \mu_1, \Sigma_1,\mu_2,\Sigma_2\). Note that these parameters exhibit certain structures, i.e. we should assume the following independencies \[
p(p_2, \mu_1, \Sigma_1, \mu_2, \Sigma_2) = p(p_2)\cdot p(\mu_1,\Sigma_1)\cdot p(\mu_2,\Sigma_2)
\] This assumption is natural, since in a mixture model, different components should not be interrelated, and the same holds between components and class labels.
For simplicity, we use beta distribution for \(p(p_2)\) and Jeffreys’ prior for \(p(\mu_1,\Sigma_1)\) and \(p(\mu_2,\Sigma_2)\). That is \[
\begin{aligned}
&p(p_2) \sim \text{Beta}(\alpha, \beta) \\
&p(\mu_1,\Sigma_1) \propto |\Sigma_1|^{-5/2} \\
&p(\mu_2,\Sigma_2) \propto |\Sigma_2|^{-5/2} \\
\end{aligned}
\] For hyperparameter \((\alpha,\beta)\), we didn’t get any extra information about the clusters, so we just choose \(\alpha=\beta=1\).
We draw samples for \((p_2, \mu_1, \Sigma_1, \mu_2, \Sigma_2, C_1,\cdots,C_n)\). As for \(p_1, Z_1,\cdots,Z_n\), just use \[
p_1=1-p_2,\,\,Z_i = C_i+1
\]
Code
library(MCMCpack, quietly=T)
##
## Markov Chain Monte Carlo Package (MCMCpack)
## Copyright (C) 2003-2023 Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park
##
## Support provided by the U.S. National Science Foundation
## (Grants SES-0350646 and SES-0350613)
##
Code
library(mvtnorm)set.seed(102932)mixgauss <-read.table("./mixgauss.dat", header=FALSE)n <-nrow(mixgauss)p <-3#we need initial value for p_2, \mu_1, \mu_2, \Sigma_1, \Sigma_2, C_1,...,C_np2.0<-0.5Ci.0<-sample(0:1, n, replace=TRUE) # 0: class 1, 1: class 2mu1.0<-colMeans(mixgauss[Ci.0==0, ])mu2.0<-colMeans(mixgauss[Ci.0==1, ])Sigma1.0<-cov(mixgauss[Ci.0==0, ])Sigma2.0<-cov(mixgauss[Ci.0==1, ])# we sample in the order p_2, \mu_1, \Sigma_1, \mu_2, \Sigma_2, C_1,...,C_n#' Gibbs sampler for p_2, \mu_1, \Sigma_1, \mu_2, \Sigma_2, C_1,...,C_n#' @param ... sample of step t-1#' @return sample of step tGibbs <-function(p2, mu1, mu2, Sigma1, Sigma2, Ci){ n2 <-sum(Ci) n1 <- n - n2#sample p p2.t <-rbeta(1, n2+1, n1+1)#sample mu mu1.t <-c(mvtnorm::rmvnorm(1,colMeans(mixgauss[Ci==0, ]), Sigma1/n1)) mu2.t <-c(mvtnorm::rmvnorm(1,colMeans(mixgauss[Ci==1, ]), Sigma2/n2))#sample Sigma X1 <-t(mixgauss[Ci==0, ]) X2 <-t(mixgauss[Ci==1, ]) S1 <-tcrossprod(X1 - mu1.t) S2 <-tcrossprod(X2 - mu2.t) Sigma1.t <- MCMCpack::riwish(n1+1, S1) Sigma2.t <- MCMCpack::riwish(n2+1, S2)#unnormalized prob pCi.0<-dmvnorm(mixgauss, mean=mu1.t, sigma=Sigma1.t) * (1-p2.t) pCi.1<-dmvnorm(mixgauss, mean=mu2.t, sigma=Sigma2.t) * p2.t Ci.t <-mapply(function(p0, p1) sample(0:1, 1, prob=c(p0, p1)), pCi.0, pCi.1)return(list(p2=p2.t, mu1=mu1.t, mu2=mu2.t,Sigma1=Sigma1.t, Sigma2=Sigma2.t, Ci=Ci.t))}#samplingsamp.size <-10000samples.out <-vector("list", samp.size)samples.out[[1]] <-list(p2=p2.0, mu1=mu1.0, mu2=mu2.0,Sigma1=Sigma1.0, Sigma2=Sigma2.0, Ci=Ci.0)for (i in2:(samp.size+1)){ samples.out[[i]] <-do.call(Gibbs, args=samples.out[[i-1]])}#To make the final result of cluster labels Z_1...Z_n, and p_1, p_2, justsamples.out <-lapply(samples.out, function(sublist) { sublist$p1 <-1-sublist$p2 sublist$Zi <- sublist$Ci+1return(sublist)})#one examplesamples.out[[samp.size+1]]
# Create a 3D scatter plot with different colors for each classscatter_plot <-plot_ly(data = mixgauss,x =~mixgauss[, 1], y =~mixgauss[, 2], z =~mixgauss[, 3],type ="scatter3d", color =~factor(Zi.hat.vi),mode="markers",opacity=0.7)# Add the cluster center to the scatter plotcenter <-data.frame(rbind(mu1.hat, mu2.hat))colnames(center) <-c("x", "y", "z")center$class <-c("c1", "c2")final_plot <- scatter_plot %>%add_trace(data=center,x=~x,y=~y,z=~z,color=~class,mode="markers",type="scatter3d",opacity=1,marker =list(size =10, symbol ="square"))# Show the final plotfinal_plot
We show the cluster centers by square symbol. However, from the plot, maybe it’s not appropriate to assume that the data comes from 2 classes.
Problem 4
(a)
We consider the normal model of multiple observations. The likelihood is \[
\begin{aligned}
p(\mathbf{y}|\theta, \sigma^2) &= \prod_i p(y_i|\theta, \sigma^2) \\
&= \prod_i \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left[ -\frac{(y_i-\mu)^2}{2\sigma^2} \right]
\end{aligned}
\] so \[
\begin{aligned}
\log p(\mathbf{y}|\theta, \sigma^2) &= - \sum_i
\log \sqrt{2\pi \sigma^2} + \frac{(y_i-\mu)^2}{2\sigma^2} \\
&=\text{const} - \frac{n}{2}\log \sigma^2 - \frac{(n-1) s_y^2+n(\bar{y}-\mu)^2}{2 \sigma^2}
\end{aligned}
\] in which, \(s_y^2=\sum_i(y_i-\bar{y})^2/(n-1)\).
Therefore, we have \[
\begin{aligned}
&\frac{\partial \log p(\mathbf{y}|\theta, \sigma^2)}{\partial \mu} = \frac{n(\bar{y}-\mu)}{\sigma^2}\\
&\frac{\partial \log p(\mathbf{y}|\theta, \sigma^2)}{\partial \sigma^2} = -\frac{n}{2\sigma^2}+\frac{(n-1) s_y^2+n(\bar{y}-\mu)^2}{2(\sigma^2)^2}
\end{aligned}
\] and then \[
\begin{aligned}
&\frac{\partial^2 \log p(\mathbf{y}|\theta, \sigma^2)}{\partial \mu^2} = - \frac{n}{\sigma^2}\\
&\frac{\partial^2 \log p(\mathbf{y}|\theta, \sigma^2)}{\partial \sigma^2} = \frac{n}{2\sigma^4} - \frac{(n-1) s_y^2+n(\bar{y}-\mu)^2}{\sigma^6}\\
&\frac{\partial^2 \log p(\mathbf{y}|\theta, \sigma^2)}{\partial \mu \partial \sigma^2} = \frac{\partial^2 \log p(\mathbf{y}|\theta, \sigma^2)}{\partial \sigma^2 \partial \mu} = - \frac{n(\bar{y}-\mu)}{\sigma^4}
\end{aligned}
\] Therefore, the Fisher Information is \[
\begin{aligned}
I(\mu, \sigma^2) &= -\mathbb{E}\begin{bmatrix}
- \frac{n}{\sigma^2} & - \frac{n(\bar{y}-\mu)}{\sigma^4}\\
- \frac{n(\bar{y}-\mu)}{\sigma^4} & \frac{n}{2\sigma^4} - \frac{(n-1) s_y^2+n(\bar{y}-\mu)^2}{\sigma^6}
\end{bmatrix} \\
&= -\begin{bmatrix}
- \frac{n}{\sigma^2} & 0 \\
0 & \frac{n}{2\sigma^4} - \frac{(n-1) \sigma^2+\sigma^2}{\sigma^6}
\end{bmatrix}\\
&=\begin{bmatrix}
\frac{n}{\sigma^2} & 0 \\
0 & \frac{n}{2\sigma^4}
\end{bmatrix}
\end{aligned}
\] in which, we use \[
\mathbb{E}(\bar{y}-\mu)=0, \mathbb{E}(\bar{y}-\mu)^2 =\mathbb{E}s_y^2=\sigma^2
\] Therefore, the Jeffreys’ prior is \[
p_J(\mu, \sigma^2) \propto \sqrt{\frac{n^2}{2\sigma^6}}
\propto (\sigma^2)^{-3/2}
\]
(b)
The posterior distribution is \[
\begin{aligned}
p_J(\mu, \sigma^2|\mathbf{y}) &\propto p_J(\mu, \sigma^2) p(\mathbf{y}|\mu, \sigma^2) \\
& \propto (\sigma^2)^{-3/2} \prod_i \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left[ -\frac{(y_i-\mu)^2}{2\sigma^2} \right] \\
& \propto \left(\sigma^2\right)^{-3 / 2} \cdot \left(\sigma^2\right)^{-n / 2} \exp \left[-\frac{(n-1) s_y^2+n(\bar{y}-\mu)^2}{2 \sigma^2}\right] \\
& \propto \sigma^{-1} \cdot (\sigma^2)^{-(n/2+1)} \exp
\left[-\frac{1}{2\sigma^2} \left( n \cdot \frac{\sum_i(y_i-\bar{y})^2}{n} +n(\bar{y}-\mu)^2 \right) \right]
\end{aligned}
\] It’s known that this term follows a Normal-Inverse-\(\chi^2\) distribution, formally (using the notation in the book Bayesian Data Analysis Third edition), \[
p_J(\mu, \sigma^2|\mathbf{y}) \sim \text{N-Inv-}\chi^2 \left(\bar{y}, \frac{\sum_i(y_i-\bar{y})^2}{n^2}; n, \frac{\sum_i(y_i-\bar{y})^2}{n} \right)
\] To see more clearly, we can rewrite \[
p_J(\mu|\sigma^2, \mathbf{y}) \propto \sigma^{-1} \exp
\left[-\frac{1}{2\sigma^2/n} \left(\mu - \bar{y} \right)^2 \right] \sim N\left(\bar{y}, \frac{\sigma^2}{\kappa_n}\right), \kappa_n=n
\]\[
\begin{aligned}
p_J(\sigma^2|\mathbf{y}) &\propto (\sigma^2)^{-(n/2+1)} \exp
\left[-\frac{1}{2\sigma^2} \left( n \cdot \frac{\sum_i(y_i-\bar{y})^2}{n} \right) \right] \\
& \propto (\sigma^2)^{-(\nu_n/2+1)} \exp
\left[-\frac{1}{2\sigma^2} \left( \nu_n \cdot \sigma_n^2 \right) \right] \\
& \sim \text{Inv-}\chi^2(\nu_n, \sigma_n^2),\quad \nu_n=n,
\sigma_n^2 = \frac{\sum_i(y_i-\bar{y})^2}{n}
\end{aligned}
\] Hence, this joint density \(p_J(\mu, \sigma^2|\mathbf{y})\) can be considered a proper posterior density, which is \[
\text{N-Inv-}\chi^2 \left(\bar{y}, \frac{\sum_i(y_i-\bar{y})^2}{n^2}; n, \frac{\sum_i(y_i-\bar{y})^2}{n} \right)
\]
(c)
We can rewrite the prior of \((\theta, \Sigma)\) to \[
p_J(\theta, \Sigma) = C_1 |\Sigma|^{-(p+2)/2}
\] in which \(C_1\) is a constant.
Now, we assume that \(p_J(\theta, \Sigma)\) is proper, that is \[
\int p_J(\theta, \Sigma) d\theta d\Sigma = \int C_1 |\Sigma|^{-(p+2)/2}d\theta d\Sigma = 1
\] By Fubini’s Theorem, as well as this post, the marginal distribution of \(\Sigma\) should also be proper (a.s.). However, if we calculate the marginal distribution directly, note that the support of \(theta\) is \(\mathbb{R}^p\)\[
\begin{aligned}
p(\Sigma) = \int_{\mathbb{R}^p} p_J(\theta, \Sigma) d\theta &= C_1|\Sigma|^{-(p+2)/2} \int_{\mathbb{R}^p} 1\cdot d\theta \\
&= C_1|\Sigma|^{-(p+2)/2} \cdot \infty\\
&=\infty
\end{aligned}
\] Obviously, the integral above is divergent, which means \(p(\Sigma)\) is not proper. By contradiction, \(p_J(\theta, \Sigma)\) must be improper. So it cannot actually be a probability density for \((\theta, \Sigma)\).
(d)
Just do it.
Prior \[
p_J(\theta, \Sigma) \propto |\Sigma|^{-(p+2)/2}
\] Likelihood \[
\begin{aligned}
p(\mathbf{y}_1,\cdots,\mathbf{y}_n|\theta, \Sigma) &\propto |\Sigma|^{-n / 2} \exp \left[-\frac{1}{2} \sum_{i=1}^n\left(y_i-\theta\right)^T \Sigma^{-1}\left(y_i-\theta\right)\right] \quad y \in \mathbb{R}^p \\
& \propto |\Sigma|^{-n / 2} \exp \left[-\frac{1}{2} \operatorname{tr}\left(\Sigma^{-1} S_0\right)\right],\quad S_0=\sum_{i=1}^n\left(y_i-\theta\right)\left(y_i-\theta\right)^T
\end{aligned}
\] Posterior \[
\begin{aligned}
p_J(\theta, \Sigma|\mathbf{y}_1,\cdots,\mathbf{y}_n) &\propto p_J(\theta, \Sigma) p(\mathbf{y}_1,\cdots,\mathbf{y}_n|\theta, \Sigma) \\
& \propto |\Sigma|^{-(n+p+2) / 2} \exp \left[-\frac{1}{2} \operatorname{tr}\left(\Sigma^{-1} S_0\right)\right]
\end{aligned}
\] Note that \[
\begin{aligned}
\operatorname{tr}\left(\Sigma^{-1} S_0\right)&=\sum_{i=1}^n\left(y_i-\theta\right)^T \Sigma^{-1}\left(y_i-\theta\right) \\
& =\sum_{i=1}^n\left(y_i-\bar{y}+\bar{y}-\theta\right)^T \Sigma^{-1}\left(y_i-\bar{y}+\bar{y}-\theta\right) \\
& =\sum_{i=1}^n\left(y_i-\bar{y}\right)^T \Sigma^{-1}\left(y_i-\bar{y}\right)-2 \sum_{i=1}^n\left(y_i-\bar{y}\right)^T \Sigma^{-1}(\bar{y}-\theta)+n(\bar{y}-\theta)^T \Sigma^{-1}(\bar{y}-\theta) \\
& =\sum_{i=1}^n\left(y_i-\bar{y}\right)^T \Sigma^{-1}\left(y_i-\bar{y}\right)+n(\theta-\bar{y})^T \Sigma^{-1}(\theta-\bar{y}) \\
& =\operatorname{tr}\left(\Sigma^{-1} S\right)+n(\theta-\bar{y})^T \Sigma^{-1}(\theta-\bar{y}),
\quad S=\sum_{i=1}^n (y_i-\bar{y})(y_i-\bar{y})^T
\end{aligned}
\] therefore \[
\begin{aligned}
p_J(\theta, \Sigma|\mathbf{y}_1,\cdots,\mathbf{y}_n)
& \propto |\Sigma|^{-((n+p) / 2 + 1)} \exp \left[-\frac{1}{2} \left( \operatorname{tr}\left(\Sigma^{-1} S\right)+n(\mu-\bar{y})^T \Sigma^{-1}(\mu-\bar{y}) \right) \right]
\end{aligned}
\] It’s known that this term follows a Normal-Inverse-Wishart distribution, formally (using the notation in the book Bayesian Data Analysis Third edition), \[
p_J(\theta, \Sigma|\mathbf{y}_1,\cdots,\mathbf{y}_n)
\sim \text{Normal-Inverse-Wishart}\left(\bar{y},\frac{S}{n};n, S \right)
\] And we can get \[
\begin{aligned}
p_J(\theta| \Sigma,\mathbf{y}_1,\cdots,\mathbf{y}_n)
& \propto |\Sigma/n|^{-1 / 2} \exp \left[ -\frac{1}{2}(\theta-\bar{y})^T \left(\frac{\Sigma}{n}\right)^{-1}(\theta-\bar{y}) \right] \sim N \left(\bar{y},\frac{\Sigma}{n} \right)
\end{aligned}
\]\[
p_J(\Sigma|\mathbf{y}_1,\cdots,\mathbf{y}_n)
\propto |\Sigma|^{-((n+p+1) / 2 )} \exp \left[-\frac{1}{2} \operatorname{tr}\left(\Sigma^{-1} S\right) \right] \sim \text{Inv-Wishart}_n(S^{-1})
\]